In the first regression analysis, we have known that there are some factors significantly contributing to the number of winnings, especially some factors about three-pointer. In this section, we will continue explore the relationships between factors of all aspects and the scores made in every game, by using data from every game in the last 20 years in NBA and using logistic regression to analyze what factors might affect the game’s score to try to find out what New York Knicks needs to improve for achieving higher score.
knitr::opts_chunk$set(
fig.width = 6,
fig.asp = .6,
out.width = "90%"
)
theme_set(theme_minimal() + theme(legend.position = "bottom"))
options(
ggplot2.continuous.colour = "viridis",
ggplot2.continuous.fill = "viridis"
)
scale_colour_discrete = scale_colour_viridis_d
scale_fill_discrete = scale_fill_viridis_d
box_score_all = read_csv("./data2/box_score_all.csv")
## Rows: 47830 Columns: 30
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): season_year, team_abbreviation, team_name, game_id, matchup, wl
## dbl (23): team_id, min, fgm, fga, fg_pct, fg3m, fg3a, fg3_pct, ftm, fta, ft...
## dttm (1): game_date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The box_score_all dataset contains 47830 games’ data from season 2012-2013 to season 2020-2021, which includes 30 variables. We are going to use some of them in the following exploratory data analysis.
We should delete some variables which are not applicable for regression analysis, like season_year, team_id and some rank variables.Then select the reasonable variables to analyze to create a data set called “regre_df”, and use regre_df to draw boxplots of comparison of winning team and losing team.
These are some reasonable variables that should be added into the regression model:
In order to build logistic regression model, we will replace win/loss with 1/0. Then draw boxplots of every variable to compare the characteristics of winning and losing teams.
regre_df =
box_score_all %>%
select(-c(1:7)) %>%
select(-ends_with("rank")) %>%
mutate(wl = recode(wl, "W" = 1, "L" = 0),
wl = as.factor(wl))
In this part, we explore each game in the past 20 years, try to find some important variables that might affect the result of the game. By this process, we also can get some insight on choosing potential parameters for model building.
Firstly, we take a look on the scores difference between the winning team and losing team in the past 20 years.
We can see that if a team wants to win the game, the score they needs to achieve become much higher compared to the past years. The average score for the winning team has some up and down form 2001-2015 seasons, however, after entering the small ball revolution, the average score for winning keep increase and never fall down since 2015-16 season.
So, it is obviously that if a team wants to win a game, they need to find a new techniques to earn more score. Next we will explore some factors we think might play a rule on the result of the game.
lose_game=
box_score_all %>%
filter(wl == "L")
box_score_all %>%
filter(wl == "W") %>%
ggplot(aes(x = pts, y = season_year)) +
geom_density_ridges(scale = .8, alpha = .5, fill = "blue",
quantile_lines = T, quantile_fun = mean) +
geom_density_ridges(data = lose_game, aes(x = pts, y = season_year),
scale = .8, alpha = .5, fill = "salmon",
quantile_lines = T, quantile_fun = mean) +
scale_fill_manual(name = "Team", values = cols) +
xlim(65, 140) +
labs(x = "Scores",
y = "Season Year",
title = "Score Distribution Win and Lose game")
## Picking joint bandwidth of 2.24
## Warning: Removed 121 rows containing non-finite values (stat_density_ridges).
## Picking joint bandwidth of 2.31
## Warning: Removed 81 rows containing non-finite values (stat_density_ridges).

As a team needs to gain more score for winning the game, to analyze the factors of game result, we first look at some variables that directly have influence on score. We put both the plot of percentage and attempted together, so we can observe NBA’s trend of scoring strategy in these 20 years.
First, we can see that although the field goal attempted and percentage didn’t seems to have much change through these 20 years, the winning team have much stronger field goal percentage compare the team who lose. Also, we can see that the losing team have a slightly more field goal attempted than the winning team.
Secondly, we can see that the 3 point field goals’s percentage didn’t change much in these two decades.However, there is a really significant increase on the 3 point field goals attempted. After the small ball era at 2015, the 3 point field goals attempted grow up remarkably, also we can see that same pattern as field goal attempted, the losing team also have higher 3 point field goals attempted.
Third, we can see that free throw attempted and percentage didn’t have much change through these 20 years. The winning team have higher attempted and percentage.
By inspect these variables, we can conclude that on the basketball field there are almost like a three points field goals fight after 2015, everyone throw as much as 3 points play as they can. The most notable difference of attempted between winning team and losing team happened on the free throw attempted, this means that even that the free throw only contribute one point in the score, it still is a indicator of the game result.And the most outstanding difference of percentage between winning team and losing team happened on the field goal percentage, this suggest that the to improve field goal percentage is one of the most critcal thing a team should consider if they want to win the game.
So since 3 points field attempted become trend in every game, we look up three of the most higher 3 points field attempted out liner on the plot. We find out it all made by Houston Rockets, so we took a look deeply, we find out in the top ten of the highest higher 3 points field in these 20 years, Houston Rockets occupy 8 of it and the other 2 is Atlanta Hawks.
plot_ly( box_score_all, x = ~ season_year, y = ~ fga , color = ~ wl, type = "box") %>%
layout(boxmode = "group",
xaxis = list(title = 'Season Year'),
yaxis = list(title = "Field Goal Attempted"))
plot_ly( box_score_all, x = ~ season_year, y = ~ fg_pct, color = ~ wl, type = "box") %>%
layout(boxmode = "group",
xaxis = list(title = 'Season Year'),
yaxis = list(title = "Field Goal Percentage"))
plot_ly(box_score_all, x = ~ season_year, y = ~ fg3a, color = ~ wl, type = "box") %>%
layout(boxmode = "group",
xaxis = list(title = 'Season Year'),
yaxis = list(title = "3 Point Field Goals Attempted"))
plot_ly(box_score_all, x = ~ season_year, y = ~ fg3_pct, color = ~ wl, type = "box") %>%
layout(boxmode = "group",
xaxis = list(title = 'Season Year'),
yaxis = list(title = "3 Point Field Goals Percentage"))
plot_ly(box_score_all, x = ~ season_year, y = ~ fta, color = ~ wl, type = "box") %>%
layout(boxmode = "group",
xaxis = list(title = 'Season Year'),
yaxis = list(title = "Free Throw Attempted"))
plot_ly(box_score_all, x = ~ season_year, y = ~ ft_pct, color = ~ wl, type = "box") %>%
layout(boxmode = "group",
xaxis = list(title = 'Season Year'),
yaxis = list(title = "Free Throw Percent"))
nyk=
box_score_all %>%
filter(team_abbreviation =="NYK")
plot_ly(nyk, x = ~ season_year, y = ~ fg3a, color = ~ wl, type = "box") %>%
layout(boxmode = "group",
xaxis = list(title = 'Season Year'),
yaxis = list(title = "3 Point Field Goals Attempted"))
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
## Warning: 'layout' objects don't have these attributes: 'boxmode'
## Valid attributes include:
## '_deprecated', 'activeshape', 'annotations', 'autosize', 'autotypenumbers', 'calendar', 'clickmode', 'coloraxis', 'colorscale', 'colorway', 'computed', 'datarevision', 'dragmode', 'editrevision', 'editType', 'font', 'geo', 'grid', 'height', 'hidesources', 'hoverdistance', 'hoverlabel', 'hovermode', 'images', 'legend', 'mapbox', 'margin', 'meta', 'metasrc', 'modebar', 'newshape', 'paper_bgcolor', 'plot_bgcolor', 'polar', 'scene', 'selectdirection', 'selectionrevision', 'separators', 'shapes', 'showlegend', 'sliders', 'spikedistance', 'template', 'ternary', 'title', 'transition', 'uirevision', 'uniformtext', 'updatemenus', 'width', 'xaxis', 'yaxis', 'barmode', 'bargap', 'mapType'
plot_ly(nyk, x = ~ season_year, y = ~ fg3_pct, color = ~ wl, type = "box") %>%
layout(boxmode = "group",
xaxis = list(title = 'Season Year'),
yaxis = list(title = "3 Point Field Goals Percentage"))
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
## Warning: 'layout' objects don't have these attributes: 'boxmode'
## Valid attributes include:
## '_deprecated', 'activeshape', 'annotations', 'autosize', 'autotypenumbers', 'calendar', 'clickmode', 'coloraxis', 'colorscale', 'colorway', 'computed', 'datarevision', 'dragmode', 'editrevision', 'editType', 'font', 'geo', 'grid', 'height', 'hidesources', 'hoverdistance', 'hoverlabel', 'hovermode', 'images', 'legend', 'mapbox', 'margin', 'meta', 'metasrc', 'modebar', 'newshape', 'paper_bgcolor', 'plot_bgcolor', 'polar', 'scene', 'selectdirection', 'selectionrevision', 'separators', 'shapes', 'showlegend', 'sliders', 'spikedistance', 'template', 'ternary', 'title', 'transition', 'uirevision', 'uniformtext', 'updatemenus', 'width', 'xaxis', 'yaxis', 'barmode', 'bargap', 'mapType'
Next, we are going to explore the influence of some offensive strategies on the basketball field, to see what kind of techniques might play a role on the result of the game.
The average offensive rebounds is slightly higher in the losing team. And in the average assists per game the winning team is significantly higher.
Average offensive rebounds are higher in the losing team seems like a same pattern as the attempted also higher in the losing team. We can hypothesize that when a team started to lose they will take more aggressive strategies compared to the team who keep leading .
Average assists per games is significantly higher in the winning team might result from the assists is defined as scoring successfully, and more scoring means more possible to win.
offensive_df %>%
group_by(season_year, wl) %>%
summarise(oreb= mean(oreb)) %>%
plot_ly(x = ~ season_year, y = ~ oreb, type = "bar",
color = ~wl, colors = "viridis") %>%
layout(barmode = "group",
xaxis = list(title = 'Season Year'),
yaxis = list(title = "Average offensive rebounds of each game"))
## `summarise()` has grouped output by 'season_year'. You can override using the `.groups` argument.
offensive_df %>%
group_by(season_year, wl) %>%
summarise(ast= mean(ast)) %>%
plot_ly(x = ~ season_year, y = ~ ast, type = "bar",
color = ~wl, colors = "viridis") %>%
layout(barmode = "group",
xaxis = list(title = 'Season Year'),
yaxis = list(title = "Aaverage assists of each game"))
## `summarise()` has grouped output by 'season_year'. You can override using the `.groups` argument.
In this part, we are going to explore the influence of some defensive level strategies on the basketball field, to see what kind of defensive techniques might play a role on the result of the game.
Steals, Blocks, Defensive rebounds of each game is significantly higher in the winning team, Personal foul and Turnovers of each game are slightly higher in the losig team.
box_score_all %>%
group_by(season_year, wl) %>%
summarise(stl= mean(stl)) %>%
plot_ly(x = ~ season_year, y = ~ stl, type = "bar",
color = ~wl, colors = "viridis") %>%
layout(barmode = "group",
xaxis = list(title = 'Season Year'),
yaxis = list(title = "Aaverage steals of each game"))
## `summarise()` has grouped output by 'season_year'. You can override using the `.groups` argument.
box_score_all %>%
group_by(season_year, wl) %>%
summarise(blk= mean(blk)) %>%
plot_ly(x = ~ season_year, y = ~ blk, type = "bar",
color = ~wl, colors = "viridis") %>%
layout(barmode = "group",
xaxis = list(title = 'Season Year'),
yaxis = list(title = "Aaverage blocks of each game"))
## `summarise()` has grouped output by 'season_year'. You can override using the `.groups` argument.
box_score_all %>%
group_by(season_year, wl) %>%
summarise(dreb= mean(dreb)) %>%
plot_ly(x = ~ season_year, y = ~ dreb, type = "bar",
color = ~wl, colors = "viridis") %>%
layout(barmode = "group",
xaxis = list(title = 'Season Year'),
yaxis = list(title = "Aaverage defensive rebounds of each game"))
## `summarise()` has grouped output by 'season_year'. You can override using the `.groups` argument.
box_score_all %>%
group_by(season_year, wl) %>%
summarise(tov= mean(tov)) %>%
plot_ly(x = ~ season_year, y = ~ tov, type = "bar",
color = ~wl, colors = "viridis") %>%
layout(barmode = "group",
xaxis = list(title = 'Season Year'),
yaxis = list(title = "Aaverage turnovers of each game"))
## `summarise()` has grouped output by 'season_year'. You can override using the `.groups` argument.
box_score_all %>%
group_by(season_year, wl) %>%
summarise(pf= mean(pf)) %>%
plot_ly(x = ~ season_year, y = ~ pf, type = "bar",
color = ~wl, colors = "viridis") %>%
layout(barmode = "group",
xaxis = list(title = 'Season Year'),
yaxis = list(title = "Aaverage personal foul of each game"))
## `summarise()` has grouped output by 'season_year'. You can override using the `.groups` argument.
Further, we can draw a correlation map to exam correlation of variables to help us select variables when building model.
regre_df =
regre_df %>%
mutate(
min = as.numeric(min),
fgm = as.numeric(fgm),
fga = as.numeric(fga),
fg_pct = as.numeric(fg_pct),
fg3m = as.numeric(fg3m),
fg3a = as.numeric(fg3a),
fg3_pct = as.numeric(fg3_pct),
ftm = as.numeric(ftm),
fta = as.numeric(fta),
ft_pct = as.numeric(ft_pct),
oreb = as.numeric(oreb),
dreb = as.numeric(dreb),
reb = as.numeric(reb),
ast = as.numeric(ast),
tov = as.numeric(tov),
stl = as.numeric(stl),
blk = as.numeric(blk),
blka = as.numeric(blka),
pf = as.numeric(pf),
pfd = as.numeric(pfd),
pts = as.numeric(pts),
plus_minus = as.numeric(plus_minus)
)
corr <- cor(regre_df[-1])
corrplot(corr, method = "square", order = "FPC")

Based on this plot, we can see that there are some strong correlation between some variables, like ftm and fta, dreb and reb. We will select one variable from a pair of variables that have correlation score which is more than 0.5. Therefore we can select variables which have no strong correlation with others.
After ruling out variables with strong correlation, we include variables as follow: Dependent variable is the score of each game, denoted by pts (points). Independent variables are selected from both offensive aspect and defensive aspect.
For the offensive level, variables include:
As for the defensive level, variables include:
Use step function to choose a model by AIC in a Stepwise algorithm.
ln_regre = lm(pts ~fg_pct+fg3_pct+ft_pct+oreb+dreb+ast+stl+blk+tov+pf,data = regre_df)
summary(ln_regre)
##
## Call:
## lm(formula = pts ~ fg_pct + fg3_pct + ft_pct + oreb + dreb +
## ast + stl + blk + tov + pf, data = regre_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.617 -5.038 -0.342 4.710 46.223
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -25.229423 0.515913 -48.90 <2e-16 ***
## fg_pct 138.594897 0.803307 172.53 <2e-16 ***
## fg3_pct 14.564180 0.325892 44.69 <2e-16 ***
## ft_pct 24.290515 0.331887 73.19 <2e-16 ***
## oreb 0.802525 0.009228 86.96 <2e-16 ***
## dreb 0.563889 0.006492 86.86 <2e-16 ***
## ast 0.487421 0.008136 59.91 <2e-16 ***
## stl 0.565725 0.011870 47.66 <2e-16 ***
## blk -0.152531 0.013351 -11.43 <2e-16 ***
## tov -0.682900 0.008899 -76.74 <2e-16 ***
## pf 0.404912 0.007667 52.81 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.334 on 47819 degrees of freedom
## Multiple R-squared: 0.6916, Adjusted R-squared: 0.6916
## F-statistic: 1.072e+04 on 10 and 47819 DF, p-value: < 2.2e-16
linear.step = step(ln_regre,direction="both")
## Start: AIC=190616
## pts ~ fg_pct + fg3_pct + ft_pct + oreb + dreb + ast + stl + blk +
## tov + pf
##
## Df Sum of Sq RSS AIC
## <none> 2572088 190616
## - blk 1 7021 2579109 190744
## - fg3_pct 1 107426 2679514 192571
## - stl 1 122170 2694258 192834
## - pf 1 150024 2722112 193325
## - ast 1 193050 2765138 194076
## - ft_pct 1 288123 2860211 195692
## - tov 1 316762 2888850 196169
## - dreb 1 405789 2977877 197621
## - oreb 1 406778 2978867 197637
## - fg_pct 1 1601094 4173182 213762
summary(linear.step)
##
## Call:
## lm(formula = pts ~ fg_pct + fg3_pct + ft_pct + oreb + dreb +
## ast + stl + blk + tov + pf, data = regre_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.617 -5.038 -0.342 4.710 46.223
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -25.229423 0.515913 -48.90 <2e-16 ***
## fg_pct 138.594897 0.803307 172.53 <2e-16 ***
## fg3_pct 14.564180 0.325892 44.69 <2e-16 ***
## ft_pct 24.290515 0.331887 73.19 <2e-16 ***
## oreb 0.802525 0.009228 86.96 <2e-16 ***
## dreb 0.563889 0.006492 86.86 <2e-16 ***
## ast 0.487421 0.008136 59.91 <2e-16 ***
## stl 0.565725 0.011870 47.66 <2e-16 ***
## blk -0.152531 0.013351 -11.43 <2e-16 ***
## tov -0.682900 0.008899 -76.74 <2e-16 ***
## pf 0.404912 0.007667 52.81 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.334 on 47819 degrees of freedom
## Multiple R-squared: 0.6916, Adjusted R-squared: 0.6916
## F-statistic: 1.072e+04 on 10 and 47819 DF, p-value: < 2.2e-16
The adjusted R square for the full model is 0.6916, that is to say 69.16% of variances in the response variable can be explained by the predictors.
1). to check if the error term is normally distributed with mean 0.
ggplot(data = ln_regre , aes(x = ln_regre$residuals)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Condition 1 is met.
2). to check if the error term is independent of the dependent variable.
ggplot(data = ln_regre, aes(x = ln_regre$fitted.values, y = ln_regre$residuals)) + geom_point() + geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'

Condition 2 is met as we cannot see an obvious tendency of errors.
According to model ln_regre,the equation will look like this. 
All variables selected are significant in this linear regression model.
For each additional 0.1 of proportion of field goals attempted, the points will increase 13.9.
For each additional 0.1 of proportion three points shooting, the points will increase 1.45.
For each additional 0.1 of proportion of free throw, the points will increase 2.43.
For each additional 1 of offensive rebounds per game, the points will increase 0.8.
For each additional 1 of defensive rebounds per games, the points will increase 0.56.
For each additional 1 of steals per game, the points will increase 0.57.
For each additional 1 of assists per game, the points will increase 0.45.
For each additional 1 of blocks per game, the points will decrease 0.15.
For each additional 1 of turnovers per game, the points will decrease 0.68.
For each additional 1 of personal foul per game, the points will decrease 0.4.
Separate data as 80% training data and 20% testing data for prediction.
set.seed(22)
train.index <- sample(x=1:nrow( regre_df), size=ceiling(0.9*nrow(regre_df)))
train = regre_df[train.index, ]
test =regre_df[-train.index, ]
Build logistic regression model with Step methods in both directions.
lg_regre<-glm(wl~fg_pct+fg3_pct+ft_pct+oreb+dreb+ast+stl+blk+tov+pf,data =train, family = "binomial",control = list(maxit=1000))
summary(lg_regre)
##
## Call:
## glm(formula = wl ~ fg_pct + fg3_pct + ft_pct + oreb + dreb +
## ast + stl + blk + tov + pf, family = "binomial", data = train,
## control = list(maxit = 1000))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.3302 -0.6406 0.0198 0.6309 3.2524
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -25.150274 0.282380 -89.06 <2e-16 ***
## fg_pct 30.300719 0.394694 76.77 <2e-16 ***
## fg3_pct 4.300476 0.132209 32.53 <2e-16 ***
## ft_pct 3.626952 0.133934 27.08 <2e-16 ***
## oreb 0.169106 0.003828 44.18 <2e-16 ***
## dreb 0.237543 0.003153 75.35 <2e-16 ***
## ast -0.040846 0.003216 -12.70 <2e-16 ***
## stl 0.263265 0.005102 51.60 <2e-16 ***
## blk 0.120397 0.005340 22.55 <2e-16 ***
## tov -0.177678 0.003778 -47.03 <2e-16 ***
## pf -0.067135 0.003050 -22.01 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 59676 on 43046 degrees of freedom
## Residual deviance: 35814 on 43036 degrees of freedom
## AIC: 35836
##
## Number of Fisher Scoring iterations: 5
logit.step = step(lg_regre,direction="both")
## Start: AIC=35835.62
## wl ~ fg_pct + fg3_pct + ft_pct + oreb + dreb + ast + stl + blk +
## tov + pf
##
## Df Deviance AIC
## <none> 35814 35836
## - ast 1 35976 35996
## - pf 1 36310 36330
## - blk 1 36335 36355
## - ft_pct 1 36576 36596
## - fg3_pct 1 36932 36952
## - oreb 1 37957 37977
## - tov 1 38287 38307
## - stl 1 38869 38889
## - dreb 1 43635 43655
## - fg_pct 1 44015 44035
summary(logit.step)
##
## Call:
## glm(formula = wl ~ fg_pct + fg3_pct + ft_pct + oreb + dreb +
## ast + stl + blk + tov + pf, family = "binomial", data = train,
## control = list(maxit = 1000))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.3302 -0.6406 0.0198 0.6309 3.2524
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -25.150274 0.282380 -89.06 <2e-16 ***
## fg_pct 30.300719 0.394694 76.77 <2e-16 ***
## fg3_pct 4.300476 0.132209 32.53 <2e-16 ***
## ft_pct 3.626952 0.133934 27.08 <2e-16 ***
## oreb 0.169106 0.003828 44.18 <2e-16 ***
## dreb 0.237543 0.003153 75.35 <2e-16 ***
## ast -0.040846 0.003216 -12.70 <2e-16 ***
## stl 0.263265 0.005102 51.60 <2e-16 ***
## blk 0.120397 0.005340 22.55 <2e-16 ***
## tov -0.177678 0.003778 -47.03 <2e-16 ***
## pf -0.067135 0.003050 -22.01 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 59676 on 43046 degrees of freedom
## Residual deviance: 35814 on 43036 degrees of freedom
## AIC: 35836
##
## Number of Fisher Scoring iterations: 5
For logistic regression, we can explain variables from the perspective of odds.
All variables selected are significant in this logistic regression model.
For each additional 1 of proportion of field goals attempted, the odds of win over loss will become e^30.300719 times.
For each additional 1 of proportion three points shooting, the odds of win over loss will become e^4.300476 times.
For each additional 1 of proportion of free throw, the odds of win over loss will become e^3.626952 times.
For each additional 1 of offensive rebounds per game, the odds of win over loss will become e^0.169106 times.
For each additional 1 of defensive rebounds per games, the odds of win over loss will become e^0.237543 times.
For each additional 1 of steals per game, the odds of win over loss will become e^0.263265 times.
For each additional 1 of assists per game, the odds of win over loss will become e^(-0.040846) times.
For each additional 1 of blocks per game, the odds of win over loss will become e^0.120397 times.
For each additional 1 of turnovers per game, the odds of win over loss will become e^(-0.177678) times.
For each additional 1 of personal foul per game, the odds of win over loss will become e^(-0.067135) times.
probabilities <- lg_regre %>% predict(test, type = "response")
head(probabilities)
## 1 2 3 4 5 6
## 0.9599067 0.1557291 0.9642265 0.4626808 0.9066939 0.4378931
contrasts(test$wl)
## 1
## 0 0
## 1 1
predicted.classes <- ifelse(probabilities > 0.5, "1", "0")
head(predicted.classes)
## 1 2 3 4 5 6
## "1" "0" "1" "0" "1" "0"
mean(predicted.classes == test$wl)
## [1] 0.8053523
Using the logistic model of all variables we selected, the prediction accuracy is 0.8053523.
Our final model for predicting game result is showing below.
\[Y_i=-25.150274 + 30.300719 (fg_pct)+4.300476(fg3_pct)+3.626952(ft_pct)+0.169106(oreb)+\\ 0.237543(dreb)-0.040846(ast)+0.263265(stl)+0.120397(blk)-0.177678(tov)-0.067135(pf)\]
We have built both linear and logistic regression based on the NBA data. The adjusted R square for the linear regression model is 0.6916, which can explain the game score in a large extent. And the prediction of the logistic regression model is 0.8053523, which can help us to predict the result of a game more accurately.
There are some variables that can positively attribute to the winning of a game in both two models, like proportion of field goals attempted, proportion three points shooting and proportion of free throw, which can instruct the Knicks to pay more attention on these aspects in daily training.
There is variable that can negatively attribute to the winning of a game in both two models, turnovers, which can instruct the Knicks to avoid this action in games.